This exercise focuses on applying tidymodels to build predictive models for the Airbnb dataset from the first challenge. After data cleaning and transformation, the goal is to streamline model creation, evaluation, and interpretation using tidymodels, a modular and extensible framework that follows tidyverse principles. It provides a consistent interface for preprocessing, model specification, tuning, and evaluation, supporting various algorithms while integrating seamlessly with external libraries.
Importing Necessary R Libraries
The following R libraries are necessary for performing the data manipulation tasks:
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Attaching package: 'data.table'
The following objects are masked from 'package:lubridate':
hour, isoweek, mday, minute, month, quarter, second, wday, week,
yday, year
The following objects are masked from 'package:dplyr':
between, first, last
The following object is masked from 'package:purrr':
transpose
library(corrplot)
corrplot 0.95 loaded
library(vip)
Attaching package: 'vip'
The following object is masked from 'package:utils':
vi
library(scales)library(janitor)
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(skimr) library(future)library(leaflet)library(ggplot2)library(htmltools)library(htmlwidgets)library(gt)library(dials)plan(multisession, workers = parallel::detectCores() -1) # Define helper function to extract model coefficients from resamplingget_coefs <-function(model) { model_fit <-extract_fit_engine(model)xgb.importance(model = model_fit)}
We split the dataset into training (75%) and test (25%) sets for model development and evaluation. The set.seed(123) ensures reproducibility.
set.seed(123) # ensures reproducibility of the random splitairbnb_split <-initial_split(airbnb_clean) # default split (75% train, 25% test)airbnb_train <-training(airbnb_split)airbnb_test <-testing(airbnb_split)airbnb_split
<Training/Testing/Total>
<14874/4959/19833>
Exercise 2: Create a 10-fold cross-validation object
A 10-fold cross-validation structure is created using vfold_cv(airbnb_train, v = 10), ensuring robust model evaluation by repeatedly training on nine subsets and validating on the tenth. This technique helps mitigate overfitting by providing multiple training-validation cycles, leading to a more generalized model. The seed is again set to 123 to ensure that the partitioning remains consistent across executions.
set.seed(123)airbnb_folds <-vfold_cv(airbnb_train, v =10,strata = neighbourhood_group_cleansed) # to ensure balance splits across categoriesairbnb_folds
Exercise 3: Create a recipe, define metrics, and build a tidymodels workflow with XGBoost.
A preprocessing recipe (airbnb_recipe) is developed to prepare data for modeling by normalizing numerical predictors, applying one-hot encoding to categorical features, and removing date columns. The recipe also eliminates near-zero variance predictors and highly correlated numeric variables, ensuring the dataset is optimized for xgboost, which requires all input features to be numeric. A model workflow (airbnb_wf) is constructed, incorporating the preprocessing recipe, an xgboost regression model with tunable hyperparameters, and a predefined evaluation metric set (airbnb_metrics) to assess model performance.
Exercise 4: Tune hyperparameters using cross-validation and finalize the best model
A hyperparameter tuning process is implemented using grid_latin_hypercube to define a search space for critical parameters, including tree depth, learning rate, and the number of trees. The tuning is performed with tune_grid over a 10-fold cross-validation setup, optimizing for RMSE as the primary metric. The best combination of hyperparameters is selected using select_best(tune_results, “rmse”), and the workflow is finalized with finalize_workflow before training on the full dataset with last_fit.
xgb_params <-parameters(trees() %>%range_set(c(100, 500)),learn_rate() %>%range_set(c(0.01, 0.3)),tree_depth() %>%range_set(c(3, 6)),min_n() %>%range_set(c(5, 30)),sample_prop() %>%range_set(c(0.5, 1.0)))# Generate random tuning gridxgb_grid <-grid_random(xgb_params, size =10)# Tune modeltune_results <-tune_grid( airbnb_wf,resamples = airbnb_folds,grid = xgb_grid,metrics = airbnb_metrics,control =control_grid(save_pred =TRUE))# Select best model and finalize workflowbest_params <-select_best(tune_results, metric ="rmse")final_wf <-finalize_workflow(airbnb_wf, best_params)cv <-fit_resamples(object = final_wf, resamples = airbnb_folds, control =control_resamples(save_pred =TRUE), metrics = airbnb_metrics)# Collecting the results (metrics, predictions, and coefficients) of the foldscv_metrics <-pluck(cv, ".metrics") %>%map_dfr(.f =~rbind(.x)) %>%arrange(.metric)# Fit final model on full dataset splitfinal_fit <-last_fit( final_wf,split = airbnb_split,metrics = airbnb_metrics)# Print final fit print(final_fit)
# A tibble: 40 × 4
.metric .estimator .estimate .config
<chr> <chr> <dbl> <chr>
1 mae standard 0.396 Preprocessor1_Model1
2 mae standard 0.395 Preprocessor1_Model1
3 mae standard 0.401 Preprocessor1_Model1
4 mae standard 0.396 Preprocessor1_Model1
5 mae standard 0.403 Preprocessor1_Model1
6 mae standard 0.396 Preprocessor1_Model1
7 mae standard 0.397 Preprocessor1_Model1
8 mae standard 0.417 Preprocessor1_Model1
9 mae standard 0.389 Preprocessor1_Model1
10 mae standard 0.397 Preprocessor1_Model1
# ℹ 30 more rows
Exercise 5: Extract final workflow and create airbnb_predictor for predictions
The trained model is extracted into airbnb_predictor, allowing for predictions on new data. This object encapsulates the final model after tuning and cross-validation, ensuring it can be used for real-world predictions without requiring reconfiguration. This step enables scalability, facilitating predictions on unseen listings while maintaining consistency with the trained model.
Exercise 6: Predict on airbnb_clean and visualize accuracy
Predictions are generated for the entire dataset using predict(airbnb_predictor, new_data = df), and a scatter plot is created to visualize accuracy. The plot compares actual and predicted prices, with a red dashed identity line (geom_abline(slope = 1, intercept = 0)) serving as a reference for perfect predictions. The model’s effectiveness is reflected in how closely the points align with this line, indicating strong predictive performance.
# Make predictions on full datasetpredictions <-predict(airbnb_predictor, new_data = airbnb_clean)df_with_predictions <-bind_cols(airbnb_clean, predictions)# Model evaluation ---------------------------------------------------# Calculate prediction errorsdf_with_predictions <- df_with_predictions %>%mutate(error = .pred - log_price,abs_error =abs(error),pct_error = abs_error / log_price *100)# Create accuracy visualizationggplot(df_with_predictions, aes(x = log_price, y = .pred)) +geom_point(alpha =0.5) +geom_abline(slope =1, intercept =0, color ="red", linetype ="dashed") +labs(title ="Actual vs Predicted Prices",x ="Actual Price",y ="Predicted Price" ) + theme_custom
In this exercise, we used our trained model to predict Airbnb prices for the entire dataset and plotted the results to evaluate model performance. The x-axis represents actual prices, while the y-axis represents predicted prices. The red dashed line represents a perfect prediction, meaning that all points should ideally align with it.
Interpretation of the Results
The model generally follows the trend of actual prices, indicating that it captures overall price variations well.
There is some dispersion, especially at higher price points, suggesting that the model struggles more with expensive listings.
The presence of outliers indicates cases where the model significantly under or over-predicts prices.
The model appears to perform well for mid-range prices but shows increasing variance for high-price properties.
Exercise 7: Plot predictions on a geospatial map of Barcelona and analyze performance
A geospatial analysis of prediction errors is conducted using leaflet, mapping discrepancies between actual and predicted prices across Barcelona. The visualization represents each listing as a color-coded marker, where deviations from actual values are highlighted to identify areas of higher predictive variance. Additionally, separate neighborhood-level maps provide localized insights into model performance, helping assess pricing trends and potential systemic biases in different districts.
Neighborhoods Ranked Based on Largest Prediction Errors
Neighborhood Group
Mean Error
Max Error
nou barris
0.304
3.875
sarria-sant gervasi
0.286
2.681
les corts
0.282
2.496
sant marti
0.280
2.558
horta-guinardo
0.276
2.621
sants-montjuic
0.275
3.266
gracia
0.274
4.721
sant andreu
0.273
1.771
ciutat vella
0.272
2.787
eixample
0.271
4.010
Error Map of sant marti
Error Map of eixample
Error Map of gracia
Error Map of horta-guinardo
Error Map of les corts
Error Map of ciutat vella
Error Map of sants-montjuic
Error Map of sarria-sant gervasi
Error Map of nou barris
Error Map of sant andreu
The Leaflet map visualizes the prediction errors of Airbnb pricing across different neighborhoods in Barcelona. The color-coded markers indicate the level of error: green for low error (<10%), orange for moderate error (10-30%), and red for high error (>30%). The overwhelming presence of green markers suggests that the model generally performs well, with most predictions closely matching actual prices. However, some scattered orange and red markers indicate areas where the model struggles, potentially due to local pricing anomalies or data limitations. The interactive legend allows for filtering by neighborhood, which can provide further insights into regional variations in prediction accuracy.
For the map, there is an interactive checklist option where users can filter by neighborhood. This allows for easy exploration of different districts without switching between multiple maps, as well as allowing users to check out multiple areas in one view.
Exercise 8: Select properties within a €3M budget and estimate investment recovery time
A real estate investment strategy is formulated by selecting properties within a €3 million budget using predicted Airbnb rental income and estimated neighborhood occupancy rates. The analysis is based on a standardized listing profile (two bedrooms, two beds, one bathroom, and perfect ratings), with airbnb_predictor estimating potential earnings. The break-even period is calculated by projecting revenue over time, allowing for a data-driven approach to property acquisition and financial return assessment.
# Source the function to generate property datasource("/Users/raissaangnged/Downloads/create_sample.R")set.seed(123)budget <-3000000# Given avg_prices datasetavg_prices <-data.frame(neighbourhood =c("eixample", "ciutat vella","sant marti","sants-montjuic", "sarria-sant gervasi","nou barris","horta-guinardo", "gracia","sant andreu","les corts"),avg_price =c(684012, 392645,435215,299140,980439, 201074,310891,500411,288534, 779088),pct_year_occupation =c(0.75, 0.8, 0.6, 0.7, 0.7, 0.5, 0.55, 0.7, 0.6, 0.7))investment_analysis <-data.frame()for (neighborhood in avg_prices$neighbourhood) {# Get the average price per property property_price <- avg_prices %>%filter(neighbourhood == neighborhood) %>%pull(avg_price)# Skip if price is missingif (length(property_price) ==0|is.na(property_price)) next# Calculate the maximum number of properties we can buy max_properties <-floor(budget / property_price)# Create multiple property samples and average the predictions for more accuracy num_samples <-10 predicted_prices <-numeric(num_samples)for (i in1:num_samples) {# Generate property listing for prediction with standardized features property_listing <-create_sample( df = airbnb_clean, df_prices = avg_prices,origin_sample =sample(1:nrow(airbnb_clean), size =1, replace =FALSE), neighbourhood_new = neighborhood,property_type_new ="apartment",room_type_new ="entire home/apt",accommodates_new =4,bathrooms_new =1,bedrooms_new =2,beds_new =2,cleaning_fee_new =20,host_response_time_new ="within an hour",host_response_rate_new =100,review_scores_rating_new =100,review_scores_accuracy_new =10,review_scores_cleanliness_new =10,review_scores_checkin_new =10,review_scores_communication_new =10,review_scores_location_new =10,review_scores_value_new =10,minimum_nights_new =1,maximum_nights_new =30 )# Ensure latitude and longitude columns exist in the property_listingif (!all(c("latitude", "longitude") %in%names(property_listing))) {# If missing, add them from the original sample origin_data <- airbnb_clean[sample(1:nrow(airbnb_clean), size =1, replace =FALSE), ]if (!"latitude"%in%names(property_listing)) { property_listing$latitude <- origin_data$latitude }if (!"longitude"%in%names(property_listing)) { property_listing$longitude <- origin_data$longitude } }# Predict nightly price using airbnb_predictor and apply exponentiationtryCatch({ predicted_prices[i] <-predict(airbnb_predictor, property_listing) %>%pull(.pred) %>%exp() # Convert from log back to actual prices }, error =function(e) {# If prediction fails, use a fallback methodcat("Prediction error:", e$message, "\n")# Fallback: use average price from the neighborhood with a small random adjustment predicted_prices[i] <<-mean(df_with_predictions$price[df_with_predictions$neighbourhood_group_cleansed == neighborhood], na.rm =TRUE) * (1+rnorm(1, 0, 0.05)) }) }# Use the median prediction to reduce the impact of outliers predicted_price <-median(predicted_prices, na.rm =TRUE)# Get the estimated occupancy rate for the neighborhood occupancy_rate <- avg_prices %>%filter(neighbourhood == neighborhood) %>%pull(pct_year_occupation)# Skip if occupancy rate is missingif (length(occupancy_rate) ==0|is.na(occupancy_rate)) next# Calculate expected annual revenue (assuming 365 days) annual_revenue_per_property <- predicted_price * occupancy_rate *365# Calculate annual expenses (estimate 30% of revenue for maintenance, taxes, utilities) annual_expenses_per_property <- annual_revenue_per_property *0.3# Calculate net annual revenue net_annual_revenue_per_property <- annual_revenue_per_property - annual_expenses_per_property# Total net revenue considering the number of properties total_net_annual_revenue <- net_annual_revenue_per_property * max_properties# Calculate payback period in years payback_period <- (property_price * max_properties) / total_net_annual_revenue# Calculate ROI (Return on Investment) percentage roi_percentage <- (net_annual_revenue_per_property / property_price) *100# Store results investment_analysis <-bind_rows(investment_analysis, tibble(neighbourhood = neighborhood,property_price = property_price,max_properties = max_properties,predicted_nightly_price = predicted_price,occupancy_rate = occupancy_rate,annual_revenue_per_property = annual_revenue_per_property,net_annual_revenue_per_property = net_annual_revenue_per_property,total_investment = property_price * max_properties,total_net_annual_revenue = total_net_annual_revenue,payback_period_years = payback_period,roi_percentage = roi_percentage ))}# Sort by shortest payback period (best investments first)investment_analysis <- investment_analysis %>%arrange(payback_period_years)# Print the final investment strategyprint("Optimal real estate investment strategy:")
# Visualize the resultslibrary(ggplot2)# Plot ROI by neighborhoodggplot(investment_analysis, aes(x =reorder(neighbourhood, roi_percentage), y = roi_percentage)) +geom_bar(stat ="identity", fill ="steelblue") +coord_flip() +labs(title ="Return on Investment by Neighborhood",x ="Neighborhood",y ="ROI (%)") +theme_minimal()
# Plot payback period by neighborhoodggplot(investment_analysis, aes(x =reorder(neighbourhood, -payback_period_years), y = payback_period_years)) +geom_bar(stat ="identity", fill ="coral") +coord_flip() +labs(title ="Payback Period by Neighborhood",x ="Neighborhood",y ="Years to Recover Investment") +theme_minimal()
Investment Scenarios with a €3M Budget
Based on the table, we can develop two different investment strategies:
• Risky Scenario (High Potential Returns, More Diversification)
• Less Risky Scenario (Stable Returns, Focused Investments)
• Higher risk but higher potential upside due to the diverse mix of high ROI areas (Sants-Montjuïc, Nou Barris) and long-term appreciation neighborhoods (Eixample, Ciutat Vella).
• Spreads risk across different areas, meaning if some neighborhoods underperform, others may compensate.
• Includes Eixample & Ciutat Vella, which have higher property prices but strong Airbnb demand, leading to potential future price appreciation.
⸻
2. Less Risky Investment Scenario (Focused on Stable Returns)
This approach prioritizes stability and predictable cash flows, concentrating the investment in high ROI, low-risk neighborhoods.
• Lower risk, faster payback period (~14-15 years on average).
• Sants-Montjuïc and Nou Barris have the best combination of high ROI and short payback periods.
• Avoids high-cost, low-ROI neighborhoods (Sarria-Sant Gervasi, Les Corts, Gracia).
• More predictable and stable returns from Airbnb rentals.
Final Conclusion:
Analyzing the options, we think the risky scenario is the best approach because it offers strong potential returns. Furthermore, diversifying our investment across multiple neighborhoods potentially reduces the risk of overly relying on a single market. By spreading our 3M budget across high-growth areas, we are able to maximize exposure to different demand dynamics while maintaining a good ROI.
Assuming we are young and rich (with a hypothetical €50M net worth), we think we can afford to take more calculated risks, at the potential of strategic, high-reward returns. We think this decision can position us for long-term dominance in the Barcelona Airbnb and real estate market.